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Abstract — A variational principle for the rate distortion (RD) 
theory with Bregman divergences is formulated within the ambit 
of the generalized (nonextensive) statistics of Tsallis. The Tsallis- 
Bregman RD lower bound is established. Alternate minimization 
schemes for the generalized Bregman RD (GBRD) theory are 
derived. A computational strategy to implement the GBRD model 
is presented. The efficacy of the GBRD model is exemplified with 
the aid of numerical simulations. 

I. Introduction 

The generalized (nonextensive) statistics of Tsallis [1,2] has 
recently been the focus of much attention in statistical physics, 
and allied disciplines l . Nonextensive statistics generalizes 
the extensive Boltzmann-Gibbs statistics, and has found much 
utility in complex systems possessing long range correlations, 
fluctuations, ergodicity, chirality and fractal behavior. By def- 
inition, the Tsallis entropy is defined in terms of discrete 
variables as 



S q (x) 



-, where, p (x) 



1. 



(1) 



The constant q is referred to as the nonextensivity param- 
eter. Given two independent variables x and y, one of the 
fundamental consequences of nonextensivity is demonstrated 
by the pseudo-additivity relation 



S q (xy) = S q (x) + S q (y) + (1 - q) S q (x) S q (y) 



(2) 



Here, (1) and (2) imply that extensive statistics is recovered as 
q — ► 1. Taking the limit q — > 1 in (1) and evoking 1' Hospital's 
rule, S q (x) — > S (x), the Shannon entropy. The jointly convex 
generalized Kullback-Leibler divergence (K-Ld) is of the form 
[3] 



( p( x ) 



Iq(p(x)\\r(x)) 



p(x)- 
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In the limit q — * 1, the extensive K-Ld is readily recovered. 
Akin to the Tsallis entropy, the generalized K-Ld obeys the 
pseudo-additivity relation [3]. 

Defining the q-deformed logarithm and the q-deformed 
exponential as [4] 

'A continually updated bibliography of works related to nonextensive 
statistics may be found at http://tsallis.cat.cbpf.br/biblio.htm. 
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respectively. The Tsallis entropy (1), and, the generalized K- 
Ld (3) may be written as 



Sq (P(x)) = -^2p(x) q \n q p(x) , 



and, 



h (pW \\r{x)) = -J2p( x ) ln ' 



r(x) 

q wr 



respectively. Employing the relation [4] 



ln ? ( - ) = y 9 1 ( ln 9 x - ln <? v) > 



(6) 



(7) 



(8) 



the generalized K-Ld (7) may be expressed as the generalized 
mutual information 



I,(X;x)=-i:p(x,x)ln q (^) 



(9) 



(S q (X\X=x)} 



Here, (•) denotes the expectation value. The seminal paper on 
source coding within the framework of nonextensive statistics 
by Landsberg and Vedral [5], has provided the impetus for 
a number of investigations into the use of nonextensive in- 
formation theory within the context of coding problems. The 
works of Yamano [6,7] represent a sample of some of the 
prominent efforts in this regard. The source coding theorem in 
a nonextensive setting has been derived by Yamano [8]. 

A statistical physics model for the variational problem 
encountered in rate distortion (RD) theory [9, 10] and the 
information bottleneck method [11], derived within a general- 
ized statistics framework, has been recently established [12]. 
A nonextensive Blahut-Arimorto (BA) alternate minimization 
scheme [13] has been derived. The noteworthy result of the 
study in [12] is that the nonextensive RD curves possesses a 



lower threshold for the minimum compression information in 
the distortion-compression plane, as compared to equivalent 
RD curves derived on the basis of the Boltzmann-Gibbs- 
Shannon framework. As was the case in [12], this paper 
employs values of q in the range < q < 1. This paper utilizes 
the statistical physics theory presented in [12] to formulate a 
generalized Bregman RD (GBRD) model. The GBRD model 
extends previous works by Rose [14] and Banerjee et. al. [15, 
16] to achieve a principled and practical strategy to evaluate 
RD functions. 

II. Rationale for the GBRD model 

A. Background information 

The RD problem in terms of discrete random variables is 
stated as follows: given a discrete random variable X E S 
called the source or the codebook, and, another discrete ran- 
dom variable X E E which is a compressed representation of 
X (also referred to as the quantized codebook and/or the repro- 
duction alphabet), the information rate distortion function that 
is to be obtained is the minimization of the mutual information 
I q >\{X, X) over all conditional probabilities p (x \x). 

The crux of the RD problem is the numerical determina- 
tion of the RD function using the BA scheme. The actual 
implementation of the BA scheme is sometimes impractical 
owing to a lack of knowledge of the optimal support of the 
quantized codebook X. Exact analytical solutions exist only 
for a few cases consisting of a combination well behaved 
sources and distortion measures. An initial attempt to achieve 
a practical and tractable solution to the RD problem was 
performed by Rose [14], for the case of Euclidean square 
distortion functions. Therein, it was demonstrated that for 
sources whose support is a bounded set, the RD function 
either equals the Shannon lower bound, or, the optimal support 
for the quantized codebook is finite thus permitting the use 
of a numerical procedure called the mapping method. The 
pioneering work of Rose was generalized by Banerjee et. al 
[15, 16] to include a wider class of distortion functions using 
Bregman divergences. One of the significant features in these 
works involved the formulation of a Shannon-Bregman lower 
bound. 

Motivated by the recent results [12], this paper provides 
two means to solve the GBRD problem for sources with 
bounded support. First, the analytical solution may be obtained 
from a Tsallis-Bregman lower bound (Section IV). Next, 
for a finite reproduction alphabet, the RD function may be 
numerically obtained by a computational methodology derived 
from generalized statistics (Section III). 

B. Bregman Divergences 

Definition 1 (Bregman divergences): Let <f>bea real valued 
strictly convex function defined on the convex set S C S, the 
domain of<p such that tfi is differentiable on int(S), the interior 
of S. The Bregman divergence : S x int (S) i— ► JR+ is 
defined as {z\, z 2 ) = <j>{z\) -4>{z 2 ) - {z\ - z 2 ,\7(j) (22)), 
where B cf> (z 1 ,z 2 ) = 4>{zi) - <p(z 2 ) - {zi - z 2 ,\7<p (z 2 )) is 
the gradient of <ft evaluated at z 2 . 



A number of Bregman divergences have been tabulated in 
[15]. The generalized K-Ld in (3), also referred to as the 
Csiszar generalized K-Ld is not a Bregman divergence. Since 
the seminal work by Naudts [17], Bregman divergences have 
been the object of much research in nonextensive statistics. 
Defining (f>(p) = — pln q ("Vp)> tne Bregman generalized K- 
Ld is defined as 

d4> (P,r) = E ^rj (pV 1 r?" 1 ) ~ £ (Pi ~ 
i=i v ' v ' i=i 

= 4s(p||r) 

(10) 

Setting q — 1 — ► k, (10) is consistent with Eq. (35) in [17]. 

III. THE GBRD MODEL 

This Section provides a strategy to jointly obtain the optimal 
support X s 2 of the quantized codebook with cardinality X s = 
k , and, the conditional probability p(x\x) that characterizes 
the RD problem. This is accomplished by a joint optimization 
of 
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X s: p(x\x) 
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X s ,p( x\x) 

\x s \=k 



{l q (X;X) + (3(d 4> (x,x)) 



p(x,x) 



(11) 

The optimal Lagrange multiplier (3, hereafter referred to as 
the inverse temperature, depends upon the optimal toler- 
ance level of the expectation of D =< d^{x,x) > p ( x ,x) = 
J2p(x,x)d (j , (x,x). 

x,x 

A. Constraint terms 

Generalized statistics has utilized a number of constraints 
to define expectations. The original Tsallis (OT) constraints of 
the form (^4) = J^Pi^i [1], were convenient owing to their 

i 

similarity to the maximum entropy constraints. These were 
abandoned because of difficulties encountered in obtaining an 
acceptable form for the partition function, with the exception 
of a few specialized cases. 

The OT constraints were subsequently replaced by the 
Curado-Tsallis (C-T) constraints (A)^' = J2Pi A i- Tne 

i 

C-T constraints were later replaced by the normalized 
Tsallis-Mendes-Plastino (T-M-P) constraints (A) T M P = 
v-^ q Ai. The dependence of the expectation value on the 

i 2—i Pi 

normalization pdf EPi> renders the T-M-P constraints to 

be self-referential. This paper, like [12], utilizes a recent 
methodology [18] to "rescue" the OT constraints, and, has 
linked the OT, C-T, and, T-M-P constraints. 

B. The nonextensive variational principle 

Keeping X s fixed, taking the variational derivative of (11) 
with respect to p{x\x) while enforcing ^p{x\x) = 1 with 

X 

Calligraphic symbols indicate sets 



the normalization Lagrange multiplier \(x), yields 

p(x\x)=p(x) [^{\(x)+(3d^ {x,x)}] 1/( "~ 1) , 

(12) 

Multiplying the terms in the square brackets in (12) by the 
conditional probability p(x\ x), and summing over x yields 



(13) 



p(x\X = x) 



+\(x)52p(x\x)=0. 

X 

Evoking J2p (#1 X = x) = 1, yields 

X 

q 



X(x) = 



(14) 



Here, N, (x) = ^p(x)(s^M.y. The conditional pdf 
p(x\ x) acquires the form 



P (x\ x) = Jt(t$> 

%RD = qX q (x) + ( q -l)0 (a;, £)) 
3* = 13 



p(x\X—x) 



(15) 



where 3* is the effective inverse temperature. Transforming 
q — > 2-q* in the numerator and evoking (5), (15) is expressed 
in the form of a q-deformed exponential 

p(x)exp .(-f3*dt(x,x)) 

p(x\x) = q . (16) 

Z(x,/3*) 

The partition function evaluated at each instance of the source 
distribution is 

Z(x,P*) = ^Ie? =^p(z)exp g [-/T^ (17) 

X 

The term {1 — (1 — q*) (3*d (x, x)} in the numerator of (16) 
is a manifestation of the Tsallis cut-off condition [18]. This 
implies that solutions of (16) are valid when 3*d(x,x) < 
V(l — q*)- The effective nonextensive RD Helmholtz free en- 
ergy is 

(18) 

Solution of (16) may be viewed from two distinct per- 
spectives, i.e. the canonical perspective and the parametric 
perspective. Owing to the self- referential nature of the ef- 
fective inverse temperature (3*, the analysis and solution of 
(16) within the context of the canonical perspective is a 
formidable undertaking. For practical applications, the para- 
metric perspective is utilized by evaluating the conditional 
pdf p(x\x), employing the nonextensive BA algorithm, for 
a-priori specified (3* £ [0, oo]. Note that within the context 
of the parametric perspective, the self- referential nature of (3* 



vanishes. The inverse temperature 3 and the effective inverse 
temperature (3* relate as 



= 



qK q {x)(3* 



1-/3* («-l)(d0 (x,x)) 



p( x\X— x) 



(19) 



C. Support estimation step 



The procedure in Section I1I.B. is carried out for a given 3, 
corresponding to a single point on the RD curve. Reproduction 
alphabets with optimal support are obtained by keeping p(x\x) 
fixed, and solving 

nyn l gbrd = ™n {/, (X; xj + /3 (d (x, £)) p(xA) } • 

(20) 

The solution to (20) is the optimal estimate/predictor [15, 16] 

^ = ( X ) P (x\X=x)=Y.p{ X \ X =S ) X - W 

X 

Algorithm 1 depicts the pseudo-code for the greedy joint non- 
convex optimization procedure that constitutes the solution to 
the GBRD model. 

IV. THE TSALLIS-BREGMAN LOWER BOUND 

Theorem 1 For OT constraints, the nonextensive RD func- 
tion with source X ~ p(x) and a Bregman divergence d$ is 
always lower bounded by the Tsallis-Bregman lower bound 
defined by 

R,L (D) = £ ^%y5 g (X) + 

r x } (22) 

+ sup <^ -(3D + (hv lL,f}* {x)) p{x) \ , 
where -f^P' is a unique function satisfying 



f 



dom(<fi) 



P (*) lL,p* (t) exp g , {-f3*d <j> (t, (j,)) dt = 1; V/x € dom (4>) . 

(23) 



To highlight the critical dependence of the Tsallis-Bregman 
lower bound upon the type of constraint employed to solve 
the GBRD variational problem in Section III B, and enunciate 
certain aspects of q-deformed algebra and calculus, a slightly 
weaker bound (see Appendix C, [16]) is stated and proved. 
Theorem 2 The generalized RD function is defined by 

R q (D) = sup j -(3D + f p{x) ln r 7 * (a;) dx \ , 

^>0, 7 *eA^. I Jx J 

(24) 

where A^. is the set of all admissible partition functions. 
Further, for each [3 > 0, a necessary and sufficient condition 
for 7* (a;) to attain the supremum in (24)is a probability 
density p(x) related to 7* (a;) as 



7*0*0= / p{x)exp (-8*d (i> (x,x))dx 



(25) 



Proof 



L GBRD \P{X\X)] = 

~ S x ,x P ( £ \ X ) H (rSF)) dxd£ + 

+/3 J xx p(x)p(x\x) d<f, (x,x)dxdx 

- L,xP( x )p( $ \ x ) lj v (^Sr) dxdx + 

+(3 f x ~p(x)p(x\ x) d<f, (x, x) dxdx 

Vf x . P (x)p(x\x)ln g , ( eXP ^-^f' £)) ) M+ 
+/3 /a. i£ P (^)p a;) ^ (a;, x) dxdx 

- - P (*l a0^ (,_1) (a;, /?*) (/J*d* (x, x) + 

In,. Z (x, f3*)j dxdx + [3 J x ~ p {x) p (x\ x)d c j > (x, x) dxdx 

- - j x i p(x)p(x\x) In, Z (x, (3*) dxdx 
= -J x p(x)\n q Z(x,f3*)dx 

( e ) 

- J x p (x) In, (J x p(x) cxp 9 , {~l3*d<p (x, x)) dx) dx 



L GBRD \P(X)}- 



(26) 

Here, (o) follows from In, (a;) = -In,. (Vx) !<7* = 2 — 
q, (b) follows from (16) where the partition function in the 
continuum is Z (x, [3*) — J-p(x) cxp q (~(3*d^ (x, x)) dx, (c) 
follows from (8), (d) follows from (15), and, (e) follows from 
(17). Analogous to the mapping method of [14], a mapping 
from the unit interval to the reproduction alphabet is sought 
such that the Lebesgue measure over the unit interval results 
in an optimal p(x) over the reproduction alphabet. Thus, for 
each instance of a probability measure corresponding to p (x) 
on X there exists a unique map x : [0,1] i— > X that maps 
the Lebesgue measure (p) to p (x) such that for any function 
/ defined over x, J- f (x)p(x)dx = J ue[01] f {x {u))dp, (u). 
Thus, (26) yields 

^GBRD l X ] — 

= -J x P ( X ) ln 1 (fx P ( X ) eX P«* (~P* d >f> ( x > x )) dx ) dx , 

= -f x P ( x ) ln <? (/„ € [o i] ex P<r (-P* d 4> {x, x (u))) dfi (u)) dx 

(27) 

To obtain the optimality condition, the q-deformed terms 
characteristic to generalized statistics are to be operated on 
by the dual Jackson derivative instead of the usual Newtonian 
derivative [4]. To obtain expressions analogous to the case of 
Newton-Leibniz calculus, the dual Jackson derivative defined 
as 

D (q \f( x ) = TW^^ l + ^-^( X ^° 

^ D i, hnq z(x,n = z{x ] . ) d " {x d f ) ( } 

is employed to enforce the optimality condition. The optimal- 
ity condition for (27) becomes 



that 

GBRD ~ Jx N(x,0*) - fc ° VX G C ^, 

f r p(x)p(x)exp , (-p*d(x,x))dxdx f 
heB e Jx N&F) = k o Jx€B e P ( x ) dx > 

Jx€B 6 f x P( X )P( X \ X ) dxdx = k °fxeB^P( X ) dx 1 = 

(30) 

holds true for all x e B e . Defining jfi, (x) = 

(J. p (x) cxp ? (-(3*d <t , (x, x)) dx) \ (30) yields 

/ p{x)i*p. (a;)exp,» (~f3*d(x,x))dx = l;Vx G B e . (31) 

J X 

Using the relation — In, (x) = In,. (Var)> (26) becomes 

^gbrd \P ( x )\ = P(x) ln g « 7^. (x)dx. (32) 

J x 

Thus, 7^» (x) satisfies (25) and attains the supremum in (24) 
for a given (3 and corresponding (3* 

(33) 



R q {D (j ) = -f3D p + p(x) In,. (a:) rfa; 



n(«) r « 

^ GBRD 



P ( X ) M ( eX Pg' (-P* d <l> ( X > X ( U )))) dx 

'x Lela il CX P 9 * {-P* d 4> ( x , x («))) ' 

(29) 

Evoking Bore/ isomorphism, and, assuming that the optimal 
support ^ contains a non-empty open ball B £ , e > implies 



where Z)g is the distortion value at which the supremum in 
(24) is attained for a given (3. Thus, the Tsallis-Bregman lower 
bound is 

RqL (Dp) = -fiDp + f p (x) In,. (x) dx = i? 9 (D^) . 

(34) 

V. Numerical simulations 

The efficacy of the GBRD model is computationally in- 
vestigated by drawing a sample of 1000 two-dimensional data 
points, from three spherical Gaussian distributions with centers 
(2, 3.5), (0, 0), (0, 2) (the quantized codebook). The priors 
and standard deviations are 0.3,0.4,0.3, and, 0.2,0.5,1.0, 
respectively. To test effectiveness of the support estimation 
step, the quantized codebook is shifted from the true means to 
positions at the edges of the spherical Gaussian distributions. 
The computational procedure described in Algorithm 1 is 
repeatedly solved for each value of (3*, till a reproduction 
alphabet with optimal support is obtained. This consistently 
coincides with the true mean, for a negligible error. This effect 
is particularly pronounced, and rapidly achieved, for regions 
of low and intermediate values of (3*, thus providing implicit 
proof of the relation between soft clustering and RD with 
Bregman divergences [15]. 

Fig.l depicts the RD curves for extensive RD with Bregman 
divergences [15, 16] and the GBRD model, with the con- 
stituent discrete points overlaid upon them. A Euclidean square 
distortion (a Bregman divergence) is employed. Each curve 
has been generated for values of (3 G [.1,2.5] (the extensive 
case), and (3* e [.1, 100] (the GBRD cases), respectively. Note 
that for the GBRD cases, the slope of the RD curve is —j3 
and not —j3*. Note that all GBRD curves inhabit the non- 
achievable (no compression) region of the extensive RD model 
with Bregman divergences. Further, GBRD models possessing 
a lower nonextensivity parameter q inhabit the non-achievable 
regions of GBRD models possessing a higher value of q. It 



Algorithm 1 GBRD Model 



Input 

1. X ~ p{x) over {x,}" =1 C dom (0) C 5R m . 

2. Bregman divergence g?^, \X S \ = k, effective variational 
parameter j3* € [0, oo], each (3* is a single point on the RD 
curve. 

Output 

k 



[5] 



l.X* 



k 



= p* = {{%l^}J=i}. =1 that locall y 

optimizes (11), (ftp, Dp) tradeoff at each j3* 

2. Value of R q (D) where its slope equals —j3 = 

-qX q (x)P' r 

[1-/3* ( 9 -l) (4(x,x)) f 

Method 

Unitialize with some {i}J =1 C dom (<fi). 
2. Set up outer (3* loop, 
repeat 

Blahut-Arimoto loop (16) for single value of (3* 
repeat 

for i=l to n do 
for j=l to k do 

P{Xj\Xi)<- Z(x„/3*) 

end for 



end for 

for j=l to k do 

Pi x j) *-Y,i=iP(xj\zi)p(xi) 
end for 

until convergence 

Support Estimation Step (X s using (21)) 
for j=l to k do 

Xj < 2~2i=l P( x i \ x j) x i 

end for 

until convergence 

Calculate Dp and Rp for the (3*. 

advance (3* 

13* =13* + 5(3* 



is observed that the GBRD model undergoes compression 
and clustering more rapidly than the equivalent extensive RD 
model with Bregman divergences. A primary cause for such 
behavior is the rapid increase in (3* for marginal increases in 
(3, as depicted in Fig. 2 and obtained from (19). 
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Fig. 1, Rate Distortion Curves for GBRD Model and Extensive RD with 
Bregman Divergences 
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Fig. 2. Curves for f3 v/s [3* for GBRD Model 



